Akioyamen, Tang, and Hussien (2021) - Western University
2025-04-17
I will focus on the second goal.
Many regime switching models have been widely used in the finance literature.
Most are supervised, i.e. the regimes are chosen either deterministically or to minimize the distance between the fitted model and the (labelled) data. For example:
The authors claim that the use of unsupervised learning is more limited, and they intent to fill that gap.
I will focus on securities, while balancing different areas of the economy:
New window: 03/jan/2007 to 09/apr/2025, training up to 04/dec/2020. 4826 obs.
The authors use the percentage change of all the variables. Common in finance, yields the returns interpretation, and helps with stationarity.
But, some variables carry information on the level: inflation expectations, recession indicators, FED volume, etc. The subset of those that passed an ADF test were kept untransformed.
I also imputed missing values on variables with low MV counts, but also dropped the ones that had long periods with MVs. The final dataset has 96 variables.
In PCA the variables are standardized, and the SVD decomposition yields components ortogonal to each other.
For standardization, variables need be stationary. I’ve considered PCA for nonstationary data, but most variables with important levels were already stationary.
The paper used 26 components to explain 90% of the variance. In the replication, with double the variables, 32 were used.
The regime detection is done by K-Means. Let the data be the sequence of observations \((x_t)_{t = 1}^T\), with \(d\) variables, i.e. \(x_t \in \mathbb{R}^d,~ \forall t\).
Given a value of \(k \in \mathbb{N}\), the goal of K-Means is to choose the sets \(S = \{S_1, \dots S_k\}\) that minimizes the within-cluster sum of squares:
\[ \text{argmin}_{S} \sum_{i = 1}^k \sum_{x \in S_i} \Vert x - \mu_i \Vert, ~~~ \mu_i = \frac{1}{|S_i|}\sum_{x \in S_i} x \]
Besides \(k\), the functional form of the centroid \(\mu_i\) and the distance measure \(\Vert . \Vert\) are hyperparameters.
The authors use the mean and the \(L^2\) norm, as is standard.
There are different methods to validate the choice of \(k\). The authors use the average silhouette width method.
Fix some data point \(x \in S_i\). Let \(a(x)\) be the average distance of \(x\) and its cluster; and \(b(x)\) the average distance of the second nearest cluster:
\[ a(x) = \frac{1}{|S_i| - 1}\sum_{y \in S_i,~ x \neq y} d(x, y), ~~~~ b(x) = \min_{S_j \neq S_i} \frac{1}{|S_j|}\sum_{y \in S_j} d(x, y) \]
The silhouette of \(x\) is defined as:
\[ s(x) = \frac{b(x) - a(x)}{\max\{a(x), b(x)\}} \]
Observations with low silhouette are not very well separated.
Note that, in the paper, \(k = 2\) is useful since \(k > 2\) would require multinomial classification methods.
The silhouette plot for the original paper and the replication is shown below. Both have their maxima at \(k = 2\).
The authors analyze the regime-colored two principal components (next slide). Their interpretation:
The first step was an unsupervisioned definition of the regimes via clustering.
The second step is a supervisioned approach to predict the regimes, using the cluster as labels.
The K-Means algorithm can classify new data, assigning the cluster of the closest centroid. But, the authors approach is agnostic to the clustering method.
Additionally, the classification method can be tuned in alignment with the portfolio strategy.
Given an observation, let \(y \in \{0, 1\}\) denote its regime and \(\pmb{x}\) its covariates.
The logit model predicts the probability of \(y = 1\) given \(\pmb{x}\) as:
\[ y = f(\beta_0 + \beta\pmb{x}) + e_t \]
Where \(f\) is the logistic function. The parameters \(\beta_0, \beta\) are chosen minimizing the sum of squared residuals. The predicted class is the rounded probability.
In R, the method is implemented via glm(), following standard GLM.
Assume normality of the data. Estimate \(\mu_k = E[X|y = k]\) and \(\Sigma_k = Cov(\pmb{x}|y = k)\).
Then, we can use the gaussian PDF to find the likelihood of \(y = 0\) and \(y = 1\).
Quadratic discriminant analysis classifies \(y = 1\) if the likelihood ratio \(L(y=1)/L(y=0)\) is larger than some treshold \(t\). Geometrically, it can be shown that that is equivalent to a quadratic surface separating the classes.
Linear discriminant analysis assumes \(\Sigma_0 = \Sigma_1\), which simplifies the problem into a linear surface.
Behind the scenes, the model searches for the space where the projected data is the most separated \(|\mu_1 - \mu_0|\), normalized by the sum of variances.
Graph by Dr. Guangliang Chen
The authors do not apply the Bartlett’s test to check if the covariances matrices are equal, i.e. if LDA or QDA is more appropriate. Doing that, LDA seem to be sufficiently appropriate.
In R, the method is implemented in the MASS package, following Venables and Ripley (2002).
Decision trees are a combination of nodes \(j \in \mathbb{J}\). At each node, the data is partitioned into subsets via rules of the form \(x_{s_j} \leq c_j\), \(x_{s_j} \in \pmb{x}\).
Let \(m \in \mathbb{T}\) denote the terminal node reached by \(\pmb{x}\), and \(\mathcal{R}_k\) denote the set of terminal nodes related to \(y = k\). Then:
\[ y = \sum_{k = 0}^1\beta_k I\{\pmb{x} \in \mathcal{R}_k\} + e_t = \sum_{k = 0}^1\beta_k \prod_{j \in \mathbb{J}_m}I\{x_{s_j} \leq c_j\} + e_t \]
Where \(\mathbb{J}_m\) is the set of nodes leading to \(m\). The parameters \(c_j, s_j\) are estimated by minimizing the sum of squared residuals, then again, until the SSR is above a desired value.
In R, the method is implemented the rpart package, following Breiman et. al. (1984).
AdaBoost is a boosting algorithm for weak classifiers. The authors do not specify, but I assumed a decision trees pairing.
The ensemble to be formed, with \(T\) iterations, is \(H_T(\pmb{x}) = \sum_{t = 1}^T \alpha_t h_t(\pmb{x})\).
It is built iteratively, updating (correcting) the fits of the last iterations at each step: \(H_t(x) = H_{t-1}(x) + \alpha_t h_t(x)\)
The weak learner \(h_t\) and the parameter \(\alpha_t\) are selected at each iteration to minimize the training error.
In R, the method is implemented the JOUSBoost package, following Freund and Schapire (1997).
Naive Bayes assumes that the features are conditionally independent, within each class. That is, no information shared between the predictors. This is unrealistic but implies in a simple estimation.
The prior for \(p(y = k)\) is set. The authors did not specify, but I assumed \(1/\sum I\{y = k\}\).
Then, this prior is updated via the Bayes theorem: \(p(y = k|\pmb{x}) \propto p(\pmb{x}| y = k) * p(y)\).
The authors did not specify, but I assumed a Gaussian distribution for the likelihood \(p(\pmb{x}| y = k)\). It can match the percent change transformation well, but maybe not the levels, untransformed data possibly should be handled differently.
The model then predicts \(y\) via:
\[ \text{argmax}_y p(y) \prod p(\pmb{x}|y) \]
In R, the method is implemented the naive_bayes package.
The models are fitted with the selected principal components for the training window.
10-fold cross-validation is used to access model performance. The authors present the in-sample results, which was not ideal.
Let TP, TN, FP, FN be the number of true positives, etc. Let \(\pi_0\) be the threshold probability that defines \(y = 1\). The considered metrics were:
The authors later look for which metric is more relevant for the portfolio strategy, but could’ve considered more metrics, specially the more elementar ones.
The regimes are also unbalanced, such that the model is more likely to predict the majority class. And the metrics are less informative.
In the replication, I calculated the out-of-sample metrics for the k-fold cross-validation (not the test set).
The paper considers three strategies:
As the LDA model is the strongest contender, I’ll focus on its results.
The performance metrics for the original paper and replication can be seen below.
The comparison of the strategies for S&P 500 is as below. In the replication, I added security and cluster information.
The comparison of the strategies for oil is as below. In the replication, I added security and cluster information.
The performance metrics for the original paper and replication can be seen below.
Original Paper
Replication
The performance metrics for the original paper and replication can be seen below.
The paper uses an interesting mix of models in its hybrid approach, yielding good results of regime detection and classification.
Several poor practices, and modelling decisions that could have been better explained. I tried to partially fill that in the replication.
Some strategies, specially tail-hedging, presented very different results, which could show inconsistency.
Good coding and data setup, more novelty on the replication would be nice:
Thank You! Questions?